Write a short description about the course and add a link to your GitHub repository here. This is an R Markdown (.Rmd) file so you should use R Markdown syntax.
# This is a so-called "R chunk" where you can write R code.
date()
## [1] "Wed Nov 24 11:01:03 2021"
My name is Dongming Zhang, I am a doctoral student at University of Helsinki. My project is working on the genetic study of fungi, such as Aspergillus niger, Trichoderma reesei.
To be honest, I have the experiences about R, Python, Matlab. But I am not a professional statistics student, my programming ability is limited within how to run the scripts, writing some simple functions Many of programming questions or underlined problems that I did not understand clearly. My supervisor – from department of Microbiology– introduced this amazing course for me. Thus, I would like to study the detailed knowledge about R in this course. In my project, I can proficiently use R to analyze my bioinformatic data.
Describe the work you have done this week and summarize your learning.
date()
## [1] "Wed Nov 24 11:01:03 2021"
students2014 <- read.table("./data/learning2014.txt", sep=",", header=TRUE, stringsAsFactors = T)
str(students2014)
## 'data.frame': 166 obs. of 7 variables:
## $ gender : Factor w/ 2 levels "F","M": 1 2 1 2 2 1 2 1 2 1 ...
## $ age : int 53 55 49 53 49 38 50 37 37 42 ...
## $ attitude: num 3.7 3.1 2.5 3.5 3.7 3.8 3.5 2.9 3.8 2.1 ...
## $ deep : num 3.58 2.92 3.5 3.5 3.67 ...
## $ stra : num 3.38 2.75 3.62 3.12 3.62 ...
## $ surf : num 2.58 3.17 2.25 2.25 2.83 ...
## $ points : int 25 12 24 10 22 21 21 31 24 26 ...
dim(students2014)
## [1] 166 7
This data is a survey result on teaching and learning. Data was collected on 2014, including the infromation of participants, such as ‘gender’, ‘age’, ‘attitude’. In dataset, ‘deep’ is the mean points of students to answer the 12 deep questions, ‘stra’ is the mean points of students to answer the 12 strategic questions, ‘suf’ is the mean points of students to answer the 12 surface questions. ‘points’ are the exam grades of students. This data frame has 166 observations and 7 variables.
library(ggplot2)
library(GGally)
## Registered S3 method overwritten by 'GGally':
## method from
## +.gg ggplot2
pairs(students2014[2:7], col = students2014$gender)
ggpairs(students2014, lower = list(combo = wrap("facethist", bins = 20)))
The above two figures show the relationships among 7 variables. From the ggpairs plot, it is clear that the significant correlations are tagged with asterisks ’*‘. All varibles are normally distributed, excepet the ’age’, the most people attending to survey are aged between 20 and 30. For ‘surf’ results, ‘attitude’, ‘deep’ and ‘stra’ have a significant correlations. the absolute value of ‘deep’ reaches to 0.324, which has the most significant influence to ‘surf’. With regarding to ‘points’, ‘attitude’ has the highest absolute value, 0.437, and the most significant correlationships with ‘points’.
my_model <- lm(points ~ attitude + stra + surf, data = students2014)
summary(my_model)
##
## Call:
## lm(formula = points ~ attitude + stra + surf, data = students2014)
##
## Residuals:
## Min 1Q Median 3Q Max
## -17.1550 -3.4346 0.5156 3.6401 10.8952
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 11.0171 3.6837 2.991 0.00322 **
## attitude 3.3952 0.5741 5.913 1.93e-08 ***
## stra 0.8531 0.5416 1.575 0.11716
## surf -0.5861 0.8014 -0.731 0.46563
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.296 on 162 degrees of freedom
## Multiple R-squared: 0.2074, Adjusted R-squared: 0.1927
## F-statistic: 14.13 on 3 and 162 DF, p-value: 3.156e-08
In this part, ‘attitude’, ‘stra’ and ‘surf’ are used for linear regression because they are the first three hightest correlation values towards target variable ‘points’. From the summary of the model, only ‘attitude’ has a satistically significant relationship with ‘points’. ‘stra’ and ‘surf’ are not significantly correlated with ‘points’. Additionally, Intercept has a significant influence on the model, which is also the key factor in the model. Meanwhile, multiple R-squared is relatively low, 0.2074, meaning that the model is poorly correlated to the explanatory variables.
optimized_model <- lm(points ~ attitude, data = students2014)
summary(optimized_model)
##
## Call:
## lm(formula = points ~ attitude, data = students2014)
##
## Residuals:
## Min 1Q Median 3Q Max
## -16.9763 -3.2119 0.4339 4.1534 10.6645
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 11.6372 1.8303 6.358 1.95e-09 ***
## attitude 3.5255 0.5674 6.214 4.12e-09 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.32 on 164 degrees of freedom
## Multiple R-squared: 0.1906, Adjusted R-squared: 0.1856
## F-statistic: 38.61 on 1 and 164 DF, p-value: 4.119e-09
Just using the most significant variable ‘attitude’ as a explanatory variable in ‘optimized_model’, the residuals results are not changed too much comparing to the ‘my_model’ because the most influence factor is ‘attitude’. ‘attitude’ still has the highly significant correlationship with ‘points’. Multiple R-squared value is decreased from 0.2074 in ‘my_model’ to 0.1906 in ‘optimized_model’, which means the regression model does not fit well to the explanatory variables. Normally, the good Multiple R-squared of a model is over at least 0.7.
par(mfrow = c(2,2))
plot(optimized_model, which = c(1,2,5))
‘optimized_model’ is applied for the model validation part to check the model assuptions. The basic properties of errors include that the errors are normally distributed, errors have a constant variance and errors are not correlated. Meanwhile, the size of a given error does not depend on the explanatory variables. According to the top left plot, the constant variance assumption is reasonable. All points are distributed randomly that represents the errors do not related to the explanatory variables. From the Normal Q-Q plot, it is a reasonable fit that the points are close to the fit line, which means the errors of model are normally distributed. Through the leverage of observations, the ‘attitude’ has a relatively high leverage that many regular errors with an outlier ranging from 0.02 to 0.04 on x-axis. Overall, the assumptions of model are validated well by these processes. However, this model is not good for the prediction of new data, because multiple R-squared values is much lower and few related explanatory variables. The model should be trained by more relevant data for further analysis.
date()
## [1] "Wed Nov 24 11:01:14 2021"
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(ggplot2)
library(ggpubr)
library(tidyr)
library(GGally)
library(boot)
setwd('//ad.helsinki.fi/home/z/zhangdon/Desktop/Introduction to open data science/IODS-project/data')
alc <- read.table("alc.txt", sep=",", header=TRUE, stringsAsFactors = T)
str(alc)
## 'data.frame': 370 obs. of 50 variables:
## $ school : Factor w/ 2 levels "GP","MS": 1 1 1 1 1 1 1 1 1 1 ...
## $ sex : Factor w/ 2 levels "F","M": 1 1 1 1 1 2 2 1 2 2 ...
## $ age : int 18 17 15 15 16 16 16 17 15 15 ...
## $ address : Factor w/ 2 levels "R","U": 2 2 2 2 2 2 2 2 2 2 ...
## $ famsize : Factor w/ 2 levels "GT3","LE3": 1 1 2 1 1 2 2 1 2 1 ...
## $ Pstatus : Factor w/ 2 levels "A","T": 1 2 2 2 2 2 2 1 1 2 ...
## $ Medu : int 4 1 1 4 3 4 2 4 3 3 ...
## $ Fedu : int 4 1 1 2 3 3 2 4 2 4 ...
## $ Mjob : Factor w/ 5 levels "at_home","health",..: 1 1 1 2 3 4 3 3 4 3 ...
## $ Fjob : Factor w/ 5 levels "at_home","health",..: 5 3 3 4 3 3 3 5 3 3 ...
## $ reason : Factor w/ 4 levels "course","home",..: 1 1 3 2 2 4 2 2 2 2 ...
## $ guardian : Factor w/ 3 levels "father","mother",..: 2 1 2 2 1 2 2 2 2 2 ...
## $ traveltime : int 2 1 1 1 1 1 1 2 1 1 ...
## $ studytime : int 2 2 2 3 2 2 2 2 2 2 ...
## $ failures.math: int 0 0 3 0 0 0 0 0 0 0 ...
## $ schoolsup : Factor w/ 2 levels "no","yes": 2 1 2 1 1 1 1 2 1 1 ...
## $ famsup : Factor w/ 2 levels "no","yes": 1 2 1 2 2 2 1 2 2 2 ...
## $ paid.math : Factor w/ 2 levels "no","yes": 1 1 2 2 2 2 1 1 2 2 ...
## $ activities : Factor w/ 2 levels "no","yes": 1 1 1 2 1 2 1 1 1 2 ...
## $ nursery : Factor w/ 2 levels "no","yes": 2 1 2 2 2 2 2 2 2 2 ...
## $ higher : Factor w/ 2 levels "no","yes": 2 2 2 2 2 2 2 2 2 2 ...
## $ internet : Factor w/ 2 levels "no","yes": 1 2 2 2 1 2 2 1 2 2 ...
## $ romantic : Factor w/ 2 levels "no","yes": 1 1 1 2 1 1 1 1 1 1 ...
## $ famrel : int 4 5 4 3 4 5 4 4 4 5 ...
## $ freetime : int 3 3 3 2 3 4 4 1 2 5 ...
## $ goout : int 4 3 2 2 2 2 4 4 2 1 ...
## $ Dalc : int 1 1 2 1 1 1 1 1 1 1 ...
## $ Walc : int 1 1 3 1 2 2 1 1 1 1 ...
## $ health : int 3 3 3 5 5 5 3 1 1 5 ...
## $ absences.math: int 6 4 10 2 4 10 0 6 0 0 ...
## $ G1.math : int 5 5 7 15 6 15 12 6 16 14 ...
## $ G2.math : int 6 5 8 14 10 15 12 5 18 15 ...
## $ G3.math : int 6 6 10 15 10 15 11 6 19 15 ...
## $ id.math : int 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010 ...
## $ failures.por : int 0 0 0 0 0 0 0 0 0 0 ...
## $ paid.por : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1 ...
## $ absences.por : int 4 2 6 0 0 6 0 2 0 0 ...
## $ G1.por : int 0 9 12 14 11 12 13 10 15 12 ...
## $ G2.por : int 11 11 13 14 13 12 12 13 16 12 ...
## $ G3.por : int 11 11 12 14 13 13 13 13 17 13 ...
## $ id.por : int 1001 1002 1003 1004 1005 1006 1007 1008 1009 1010 ...
## $ failures : int 0 0 2 0 0 0 0 0 0 0 ...
## $ paid : Factor w/ 2 levels "no","yes": 1 1 2 2 2 2 1 1 2 2 ...
## $ absences : int 5 3 8 1 2 8 0 4 0 0 ...
## $ G1 : int 2 7 10 14 8 14 12 8 16 13 ...
## $ G2 : int 8 8 10 14 12 14 12 9 17 14 ...
## $ G3 : int 8 8 11 14 12 14 12 10 18 14 ...
## $ alc_use : num 1 1 2.5 1 1.5 1.5 1 1 1 1 ...
## $ high_use : logi FALSE FALSE TRUE FALSE FALSE FALSE ...
## $ cid : int 3001 3002 3003 3004 3005 3006 3007 3008 3009 3010 ...
dim(alc)
## [1] 370 50
print(colnames(alc))
## [1] "school" "sex" "age" "address"
## [5] "famsize" "Pstatus" "Medu" "Fedu"
## [9] "Mjob" "Fjob" "reason" "guardian"
## [13] "traveltime" "studytime" "failures.math" "schoolsup"
## [17] "famsup" "paid.math" "activities" "nursery"
## [21] "higher" "internet" "romantic" "famrel"
## [25] "freetime" "goout" "Dalc" "Walc"
## [29] "health" "absences.math" "G1.math" "G2.math"
## [33] "G3.math" "id.math" "failures.por" "paid.por"
## [37] "absences.por" "G1.por" "G2.por" "G3.por"
## [41] "id.por" "failures" "paid" "absences"
## [45] "G1" "G2" "G3" "alc_use"
## [49] "high_use" "cid"
This data is the joint data sets merging Math course with Portuguese language course. To examine the alcohol consumption with other variables, alc_use is the average of the workday alcohol consumption and weekend alcohol consumption. high_use is TRUE when alc_use is higher than 2, otherwise is FALSE. For some of abbreviation, famrel is quality of family relationships, freetime is the free time after school, goout is going out with friends. Their values are ranging from 1 vary low to 5 very high. famsup is the boolean value which represents if the student could get the family educational support.
Initially, it is hypothesized that family and entertainment are the key factors, which would impact the alcohol consumption of students. Thus, famrel, famsup and freetime, goout are chose for the logistic regression analysis.
gather(alc) %>% ggplot(aes(value)) + facet_wrap("key", scales = "free") + geom_bar()
From the Key-value pairs result, most of students are aged from 15 to 18 years old. The legal drinking age is usually over 18 in many countries. Those who under 18 years-old consumed much alcohol is largely influenced by their family or their peers.
famrel1 <- ggplot(alc, aes(x = high_use, y = famrel, col = sex)) + geom_boxplot()
famsup1 <- ggplot(alc, aes(x = high_use, y = famsup, col = sex)) + geom_boxplot()
freetime1 <- ggplot(alc, aes(x = high_use, y = freetime, col = sex)) + geom_boxplot()
goout1 <- ggplot(alc, aes(x = high_use, y = goout, col = sex)) + geom_boxplot()
ggarrange(famrel1, famsup1, freetime1, goout1,
ncol = 2, nrow = 2)
Observing the family relationships with high_use, most of high_use students also have a good relations with family members. Just looking into the boxplots, boys have more free time than girls so they might like to drink much more. For those who go out with friends more times per week, they are easily drunk much more. Because family educational support is the Boolean values, it is not clear just observing from the box plots.
After the below numerical analysis of selected variables with high alcohol consumption, most of students do not drink too much. Observing the highest number of high used students in these 4 dimensions. 77 of students are high used with 4 degree of family relationship. The number of high used students with family educational support is 66, which is higher than those who without family support, 45. According to free time, the numbers of high used students at 3 and 4 degree are similar, both of them are 40. The highest number of high used student is 38 at 4 degree. However, the highest number of non-high used student is 97 at 3 degree. Only based on these numerical and graphical analysis, it does not conclude any relationships between the variables and alcohol consumption. Therefore, logistic regression should be operated to provide for more information.
alc %>% group_by(famrel, high_use) %>% summarise(count = n())
## `summarise()` has grouped output by 'famrel'. You can override using the `.groups` argument.
## # A tibble: 10 x 3
## # Groups: famrel [5]
## famrel high_use count
## <int> <lgl> <int>
## 1 1 FALSE 6
## 2 1 TRUE 2
## 3 2 FALSE 9
## 4 2 TRUE 9
## 5 3 FALSE 39
## 6 3 TRUE 25
## 7 4 FALSE 128
## 8 4 TRUE 52
## 9 5 FALSE 77
## 10 5 TRUE 23
alc %>% group_by(famsup, high_use) %>% summarise(count = n())
## `summarise()` has grouped output by 'famsup'. You can override using the `.groups` argument.
## # A tibble: 4 x 3
## # Groups: famsup [2]
## famsup high_use count
## <fct> <lgl> <int>
## 1 no FALSE 94
## 2 no TRUE 45
## 3 yes FALSE 165
## 4 yes TRUE 66
alc %>% group_by(freetime, high_use) %>% summarise(count = n())
## `summarise()` has grouped output by 'freetime'. You can override using the `.groups` argument.
## # A tibble: 10 x 3
## # Groups: freetime [5]
## freetime high_use count
## <int> <lgl> <int>
## 1 1 FALSE 15
## 2 1 TRUE 2
## 3 2 FALSE 46
## 4 2 TRUE 14
## 5 3 FALSE 112
## 6 3 TRUE 40
## 7 4 FALSE 65
## 8 4 TRUE 40
## 9 5 FALSE 21
## 10 5 TRUE 15
alc %>% group_by(goout, high_use) %>% summarise(count = n())
## `summarise()` has grouped output by 'goout'. You can override using the `.groups` argument.
## # A tibble: 10 x 3
## # Groups: goout [5]
## goout high_use count
## <int> <lgl> <int>
## 1 1 FALSE 19
## 2 1 TRUE 3
## 3 2 FALSE 82
## 4 2 TRUE 15
## 5 3 FALSE 97
## 6 3 TRUE 23
## 7 4 FALSE 40
## 8 4 TRUE 38
## 9 5 FALSE 21
## 10 5 TRUE 32
m <- glm(high_use ~ famrel + famsup + freetime + goout, data = alc, family = "binomial")
summary(m)
##
## Call:
## glm(formula = high_use ~ famrel + famsup + freetime + goout,
## family = "binomial", data = alc)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.7364 -0.7799 -0.5596 0.9979 2.4376
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -2.1950 0.6982 -3.144 0.00167 **
## famrel -0.4191 0.1371 -3.056 0.00224 **
## famsupyes -0.1900 0.2542 -0.747 0.45482
## freetime 0.2014 0.1364 1.477 0.13966
## goout 0.7405 0.1233 6.006 1.9e-09 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 452.04 on 369 degrees of freedom
## Residual deviance: 391.89 on 365 degrees of freedom
## AIC: 401.89
##
## Number of Fisher Scoring iterations: 4
coef(m)
## (Intercept) famrel famsupyes freetime goout
## -2.1949968 -0.4191495 -0.1899943 0.2014051 0.7405345
From the summary results of the model, family relationships and go out times are significantly influence the high alcohol consumption in students. Go out times is the most significant factor, that is 1.9e-09. Family educational support and free time are not significant according to the results. The coefficient values of model is similar to the estimate values from summary function. family relationships and family educational support are minus value (-0.4191 and -0.1899), which means that they are negative associated with the target variable (high alcohol consumption). Whereas free time and go out are the positive associated with target variable, 0.2014 and 0.74005, respectively.
OR <- coef(m) %>% exp
CI <- confint(m) %>% exp
## Waiting for profiling to be done...
cbind(OR, CI)
## OR 2.5 % 97.5 %
## (Intercept) 0.1113589 0.0272651 0.4247412
## famrel 0.6576059 0.5006796 0.8588025
## famsupyes 0.8269639 0.5027027 1.3646341
## freetime 1.2231201 0.9373983 1.6021304
## goout 2.0970561 1.6575495 2.6907078
As known, the odds ratios are calculated from the exponents of the coefficients. Those values are larger than 1 can be considered as the positive associated values, otherwise those less than 1 are negative associated values. For family factors, the better family relationships could reduce the probability of high alcohol consumption by 34.2% (1-0.658), the students without family support have 17.3% (1-0.827) higher odds to consume much more alcohol than students with family educational support. For the entertainment factors, the probability of high alcohol consumption will be increased by 1.223 times with the more free time for student, go out times will grow the probability of high alcohol consumption by 2.097 times. Confidence interval of this model is set by 97.5% for each variables, it can be intercepted by that we are 97.5% confident that variables impact the high alcohol consumption in students.
Then, to examine the exact coefficients of each variables, the intercept in the model has been removed in the below codes. It is definite that family educational support seems to have a significant influence on the highly usage of alcohol of students. However, free time is still not the significant factor in the model. So, free time should be check again if it would have a statistically significance in the model.
m1 <- glm(high_use ~ famrel + famsup + freetime + goout - 1, data = alc, family = "binomial")
summary(m1)
##
## Call:
## glm(formula = high_use ~ famrel + famsup + freetime + goout -
## 1, family = "binomial", data = alc)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.7364 -0.7799 -0.5596 0.9979 2.4376
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## famrel -0.4191 0.1371 -3.056 0.002241 **
## famsupno -2.1950 0.6982 -3.144 0.001668 **
## famsupyes -2.3850 0.6818 -3.498 0.000469 ***
## freetime 0.2014 0.1364 1.477 0.139657
## goout 0.7405 0.1233 6.006 1.9e-09 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 512.93 on 370 degrees of freedom
## Residual deviance: 391.89 on 365 degrees of freedom
## AIC: 401.89
##
## Number of Fisher Scoring iterations: 4
coef(m1)
## famrel famsupno famsupyes freetime goout
## -0.4191495 -2.1949968 -2.3849911 0.2014051 0.7405345
m2 <- glm(high_use ~ famrel + famsup + goout, data = alc, family = "binomial")
m3 <- glm(high_use ~ famrel + freetime +goout, data = alc, family = "binomial")
anova(m, m2, test = 'LRT')
## Analysis of Deviance Table
##
## Model 1: high_use ~ famrel + famsup + freetime + goout
## Model 2: high_use ~ famrel + famsup + goout
## Resid. Df Resid. Dev Df Deviance Pr(>Chi)
## 1 365 391.89
## 2 366 394.09 -1 -2.1989 0.1381
anova(m, m3, test = 'LRT')
## Analysis of Deviance Table
##
## Model 1: high_use ~ famrel + famsup + freetime + goout
## Model 2: high_use ~ famrel + freetime + goout
## Resid. Df Resid. Dev Df Deviance Pr(>Chi)
## 1 365 391.89
## 2 366 392.45 -1 -0.55693 0.4555
According to the analysis of model with and without free time, the likelihood ratio test is 0.1381, which is not significant to the model. Also, comparing with the models containing family support or not, the likelihood ratio test Free time is 0.4555. Free time and family support are not the significant factors influencing the high alcohol consumption of students. Thus, both of them can be removed from the logistic regression model for increasing the accuracy of prediction.
probabilities <- predict(m, type = "response")
alc0 <- mutate(alc, probability = probabilities) %>% mutate(alc, prediction = probabilities > 0.5)
g <- ggplot(alc0, aes(x = probability, y = high_use, col = prediction))
g + geom_point()
At first, the initial four variables (family relationships, family educational support, free time and go out) are used for logistic regression and the whole data is used for the predication validation. From the plot, the points of negative false seems larger than negative true. So, the exact prediction results should be examined by the table function.
select(alc0, age, freetime, failures, goout, high_use, probability, prediction) %>% tail(10)
## age freetime failures goout high_use probability prediction
## 361 17 4 0 3 FALSE 0.30061496 FALSE
## 362 19 3 1 2 FALSE 0.14352564 FALSE
## 363 18 4 1 3 TRUE 0.22036842 FALSE
## 364 18 3 0 3 FALSE 0.18771510 FALSE
## 365 18 4 0 3 FALSE 0.26223864 FALSE
## 366 19 4 1 2 FALSE 0.11877781 FALSE
## 367 18 3 0 4 FALSE 0.37866087 FALSE
## 368 18 1 0 1 FALSE 0.15813018 FALSE
## 369 17 4 0 5 TRUE 0.81381569 TRUE
## 370 18 4 0 1 TRUE 0.08903782 FALSE
table(high_use = alc0$high_use, prediction = alc0$prediction)
## prediction
## high_use FALSE TRUE
## FALSE 237 22
## TRUE 72 39
table(high_use = alc0$high_use, prediction = alc0$prediction) %>% prop.table() %>% addmargins()
## prediction
## high_use FALSE TRUE Sum
## FALSE 0.64054054 0.05945946 0.70000000
## TRUE 0.19459459 0.10540541 0.30000000
## Sum 0.83513514 0.16486486 1.00000000
As shown by the above two tables, the accuracy of prediction can be computed by (total negative false + total positive true) divided by total number of students. Thankfully, the prop.table function has computed this probability. The accuracy of the model is 0.6405 + 0.1054 equal to 0.7459. It means that the prediction accuracy of this model is 74.59%.
loss_func <- function(class, prob) {
n_wrong <- abs(class - prob) > 0.5
mean(n_wrong)
}
loss_func(class = alc0$high_use, prob = alc0$probability)
## [1] 0.2540541
Loss function is a easy way for prediction validation. As shown by the above result, loss function is 0.2541, the accuracy is 0.7459 (1-0.2541). The accuracy is consistent with the above handle calculated value from the table results.
From the analysis on the last step, free time is not significant and it can be removed from the logistic regression model here. Additionally, family educational support is not significant either from the Therefore, two variables (family relationships and go out) are used for the modelling and the whole data is used for prediction validation.
m4 <- glm(high_use ~ famrel + goout, data = alc, family = "binomial")
probabilities1 <- predict(m4, type = "response")
alc1 <- mutate(alc, probability = probabilities1) %>% mutate(alc, prediction = probabilities1 > 0.5)
loss_func(class = alc1$high_use, prob = alc1$probability)
## [1] 0.2432432
As we see, the loss of the model (m4) is decreased compared to the model with all four variables (m), from 0.2541 to 0.2432. The accuracy of the model prediction is increased by selecting the significant factors.
loss_func(class = alc1$high_use, prob = alc1$probability)
## [1] 0.2432432
cv <- cv.glm(data = alc, cost = loss_func, glmfit = m4, K = 10)
cv$delta[1]
## [1] 0.2513514
After cross validation, the model using 10-fold cross-validation seems not perform better than the model not using cross validation. From the values, the prediction accuracy is always larger than the loss of the initial model after cross validation. Because the data is divided into 10 parts for training and testing, the accuracy value is much reasonable by cross validation although it is relatively higher.
crossvalidation <- function(model){
cv <- cv.glm(data = alc, cost = loss_func, glmfit = model, K = 10)
cv$delta[1]
}
accuracy <- sapply(1:100, function(n)crossvalidation(m4))
accuracy
## [1] 0.2621622 0.2540541 0.2594595 0.2648649 0.2675676 0.2486486 0.2540541
## [8] 0.2567568 0.2594595 0.2648649 0.2594595 0.2486486 0.2567568 0.2513514
## [15] 0.2621622 0.2621622 0.2567568 0.2621622 0.2540541 0.2567568 0.2540541
## [22] 0.2567568 0.2486486 0.2675676 0.2621622 0.2486486 0.2432432 0.2459459
## [29] 0.2621622 0.2567568 0.2648649 0.2648649 0.2567568 0.2486486 0.2540541
## [36] 0.2513514 0.2459459 0.2648649 0.2432432 0.2513514 0.2513514 0.2513514
## [43] 0.2459459 0.2459459 0.2567568 0.2513514 0.2486486 0.2540541 0.2567568
## [50] 0.2567568 0.2648649 0.2540541 0.2567568 0.2648649 0.2594595 0.2621622
## [57] 0.2594595 0.2567568 0.2513514 0.2513514 0.2567568 0.2459459 0.2459459
## [64] 0.2540541 0.2513514 0.2594595 0.2567568 0.2540541 0.2594595 0.2567568
## [71] 0.2513514 0.2459459 0.2594595 0.2513514 0.2648649 0.2594595 0.2648649
## [78] 0.2567568 0.2648649 0.2540541 0.2567568 0.2594595 0.2567568 0.2432432
## [85] 0.2540541 0.2567568 0.2513514 0.2594595 0.2621622 0.2540541 0.2594595
## [92] 0.2540541 0.2621622 0.2702703 0.2567568 0.2540541 0.2648649 0.2594595
## [99] 0.2648649 0.2594595
v <- data.frame(accuracy)
ggplot(v, aes(x = as.numeric(row.names(v)), y = accuracy, group = as.numeric(row.names(v)))) +
geom_point(aes(color = ifelse(accuracy>0.2432432, 'red', 'blue'))) +
scale_colour_manual(labels = c("<0.2432432", ">0.2432432"), values=c('red', 'blue')) +
labs(color = "Values") +labs(x = 'cycles of cross validation', y = 'Loss of models')
Running 100 cycles of cross validation for model, there are few values of cross validation are less than loss function value of the whole data validation. However, it demonstrates that the predication accuracy by cross validation is at a range of scale due to the random separation of training and testing parts.
To examine the trend about different numbers of predictors in the model. The ten variables are selected randomly and then change to only one variable for the logistic regression model. The cross validation results are calculated based on the different numbers of variables.
# generate the models with different numbers of predictors
mm10 <- glm(high_use ~ famrel + famsup + freetime + goout + sex + age + higher + absences + failures + activities, data = alc, family = "binomial")
mm9 <- glm(high_use ~ famrel + famsup + freetime + goout + sex + age + higher + absences + failures, data = alc, family = "binomial")
mm8 <- glm(high_use ~ famrel + famsup + freetime + goout + sex + age + higher + absences, data = alc, family = "binomial")
mm7 <- glm(high_use ~ famrel + famsup + freetime + goout + sex + age + higher, data = alc, family = "binomial")
mm6 <- glm(high_use ~ famrel + famsup + freetime + goout + sex + age, data = alc, family = "binomial")
mm5 <- glm(high_use ~ famrel + famsup + freetime + goout + sex, data = alc, family = "binomial")
mm4 <- glm(high_use ~ famrel + famsup + freetime + goout, data = alc, family = "binomial")
mm3 <- glm(high_use ~ famrel + famsup + freetime, data = alc, family = "binomial")
mm2 <- glm(high_use ~ famrel + famsup, data = alc, family = "binomial")
mm1 <- glm(high_use ~ famrel, data = alc, family = "binomial")
cv10 <- c(crossvalidation(mm10), crossvalidation(mm9), crossvalidation(mm8), crossvalidation(mm7), crossvalidation(mm6), crossvalidation(mm5), crossvalidation(mm4), crossvalidation(mm3), crossvalidation(mm2), crossvalidation(mm1))
No_predictor <- 10:1
supur_bonus <- data.frame(No_predictor, cv10)
ggplot(supur_bonus, aes(x = No_predictor, y = cv10)) + geom_line() + labs(x = 'Numbers of predictors', y = 'Loss of models') + xlim(10.5, 0.5)
According to the above trend analysis, the few predictors, the higher loss of models. So the accuracy of the prediction depends on the numbers of variables in the model. However, more variables do not mean the better prediction accuracy, other insignificant predictors could decrease the accuracy of the logistic regression model. Searching all significant variables for the logistic regression model would be generate the most accurate prediction.
date()
## [1] "Wed Nov 24 11:01:42 2021"
library(MASS)
##
## Attaching package: 'MASS'
## The following object is masked from 'package:dplyr':
##
## select
library(corrplot)
## corrplot 0.91 loaded
library(tidyr)
library(PerformanceAnalytics)
## Loading required package: xts
## Loading required package: zoo
##
## Attaching package: 'zoo'
## The following objects are masked from 'package:base':
##
## as.Date, as.Date.numeric
##
## Attaching package: 'xts'
## The following objects are masked from 'package:dplyr':
##
## first, last
##
## Attaching package: 'PerformanceAnalytics'
## The following object is masked from 'package:graphics':
##
## legend
library(GGally)
library(plotly)
##
## Attaching package: 'plotly'
## The following object is masked from 'package:MASS':
##
## select
## The following object is masked from 'package:ggplot2':
##
## last_plot
## The following object is masked from 'package:stats':
##
## filter
## The following object is masked from 'package:graphics':
##
## layout
data('Boston')
The data Boston has 506 observations and 14 variables, which is the survey about housing value in suburbs of Boston. crim means per capita crime rate by town; zn is the proportion of residential land zoned for lots over 25,000 sq.ft; indus is the proportion of non-retail business acres per town. The details of the description can be found on the MASS website (click here)[https://stat.ethz.ch/R-manual/R-devel/library/MASS/html/Boston.html].
cor_matrix<-cor(Boston) %>% round(digits = 2)
corrplot(cor_matrix, method="circle", type="upper", cl.pos="b", tl.pos="d", tl.cex = 0.6)
chart.Correlation(Boston, histogram=TRUE, pch=19)
As shown on the above two figures, the correlations that the p-value > 0.01 are considered as significant. The cycles in the first one represent the significant values between two variables. Insignificant ones are shown as the blank. The blue cycles are positive correlations and red color symbols are the negative correlations. From the results of the second figure, most of variables are significant correlations with each other except chas which means Charles River dummy variable. chas does not impact other variables according to correlation matrix.
summary(Boston)
## crim zn indus chas
## Min. : 0.00632 Min. : 0.00 Min. : 0.46 Min. :0.00000
## 1st Qu.: 0.08205 1st Qu.: 0.00 1st Qu.: 5.19 1st Qu.:0.00000
## Median : 0.25651 Median : 0.00 Median : 9.69 Median :0.00000
## Mean : 3.61352 Mean : 11.36 Mean :11.14 Mean :0.06917
## 3rd Qu.: 3.67708 3rd Qu.: 12.50 3rd Qu.:18.10 3rd Qu.:0.00000
## Max. :88.97620 Max. :100.00 Max. :27.74 Max. :1.00000
## nox rm age dis
## Min. :0.3850 Min. :3.561 Min. : 2.90 Min. : 1.130
## 1st Qu.:0.4490 1st Qu.:5.886 1st Qu.: 45.02 1st Qu.: 2.100
## Median :0.5380 Median :6.208 Median : 77.50 Median : 3.207
## Mean :0.5547 Mean :6.285 Mean : 68.57 Mean : 3.795
## 3rd Qu.:0.6240 3rd Qu.:6.623 3rd Qu.: 94.08 3rd Qu.: 5.188
## Max. :0.8710 Max. :8.780 Max. :100.00 Max. :12.127
## rad tax ptratio black
## Min. : 1.000 Min. :187.0 Min. :12.60 Min. : 0.32
## 1st Qu.: 4.000 1st Qu.:279.0 1st Qu.:17.40 1st Qu.:375.38
## Median : 5.000 Median :330.0 Median :19.05 Median :391.44
## Mean : 9.549 Mean :408.2 Mean :18.46 Mean :356.67
## 3rd Qu.:24.000 3rd Qu.:666.0 3rd Qu.:20.20 3rd Qu.:396.23
## Max. :24.000 Max. :711.0 Max. :22.00 Max. :396.90
## lstat medv
## Min. : 1.73 Min. : 5.00
## 1st Qu.: 6.95 1st Qu.:17.02
## Median :11.36 Median :21.20
## Mean :12.65 Mean :22.53
## 3rd Qu.:16.95 3rd Qu.:25.00
## Max. :37.97 Max. :50.00
From the summary, the values are varies based on their judgement. For example, the crime rate is ranging from 0.00632 to 88.97620, whereas rm (average number of rooms per dwelling) has a narrow range, from 3.561 to 8.780. tax has the relatively large values compared to other variables. Thus, it is necessary to standardize the whole data.
boston_scaled <- scale(Boston)
print(summary(boston_scaled))
## crim zn indus chas
## Min. :-0.419367 Min. :-0.48724 Min. :-1.5563 Min. :-0.2723
## 1st Qu.:-0.410563 1st Qu.:-0.48724 1st Qu.:-0.8668 1st Qu.:-0.2723
## Median :-0.390280 Median :-0.48724 Median :-0.2109 Median :-0.2723
## Mean : 0.000000 Mean : 0.00000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.007389 3rd Qu.: 0.04872 3rd Qu.: 1.0150 3rd Qu.:-0.2723
## Max. : 9.924110 Max. : 3.80047 Max. : 2.4202 Max. : 3.6648
## nox rm age dis
## Min. :-1.4644 Min. :-3.8764 Min. :-2.3331 Min. :-1.2658
## 1st Qu.:-0.9121 1st Qu.:-0.5681 1st Qu.:-0.8366 1st Qu.:-0.8049
## Median :-0.1441 Median :-0.1084 Median : 0.3171 Median :-0.2790
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.5981 3rd Qu.: 0.4823 3rd Qu.: 0.9059 3rd Qu.: 0.6617
## Max. : 2.7296 Max. : 3.5515 Max. : 1.1164 Max. : 3.9566
## rad tax ptratio black
## Min. :-0.9819 Min. :-1.3127 Min. :-2.7047 Min. :-3.9033
## 1st Qu.:-0.6373 1st Qu.:-0.7668 1st Qu.:-0.4876 1st Qu.: 0.2049
## Median :-0.5225 Median :-0.4642 Median : 0.2746 Median : 0.3808
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 1.6596 3rd Qu.: 1.5294 3rd Qu.: 0.8058 3rd Qu.: 0.4332
## Max. : 1.6596 Max. : 1.7964 Max. : 1.6372 Max. : 0.4406
## lstat medv
## Min. :-1.5296 Min. :-1.9063
## 1st Qu.:-0.7986 1st Qu.:-0.5989
## Median :-0.1811 Median :-0.1449
## Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.6024 3rd Qu.: 0.2683
## Max. : 3.5453 Max. : 2.9865
Standardization is the important step for the data clustering, \(scaled(x) = \frac{x - mean(x)}{sd(x)}\). After scaling, the whole data is grouped into a small range of scale, which prevents some variables with larger scales from overweight the other variables.
# creating a categorical variable of the crime rate
boston_scaled <- as.data.frame(boston_scaled)
bins <- quantile(boston_scaled$crim)
crime <- cut(boston_scaled$crim, breaks = bins, include.lowest = TRUE, labels = c("low", "med_low", "med_high", "high"))
table(crime)
## crime
## low med_low med_high high
## 127 126 126 127
# removing the old crime rate variable and adding the new quantile of 'crime' variable
boston_scaled <- dplyr::select(boston_scaled, -crim)
boston_scaled <- data.frame(boston_scaled, crime)
# dividing the data into training and testing group
n <- nrow(boston_scaled)
ind <- sample(n, size = n * 0.8)
train <- boston_scaled[ind,]
test <- boston_scaled[-ind,]
lda.fit <- lda(crime ~ . , data = train)
lda.arrows <- function(x, myscale = 1, arrow_heads = 0.1, color = "red", tex = 0.75, choices = c(1,2)){
heads <- coef(x)
arrows(x0 = 0, y0 = 0,
x1 = myscale * heads[,choices[1]],
y1 = myscale * heads[,choices[2]], col=color, length = arrow_heads)
text(myscale * heads[,choices], labels = row.names(heads),
cex = tex, col=color, pos=3)
}
classes <- as.numeric(train$crime)
plot(lda.fit, dimen = 2, col = classes, pch = classes)
lda.arrows(lda.fit, myscale = 2)
From the liner discriminant analysis plot, the high crime rate group is separated from the low crime rate and most of mediate high crime rate groups. rad (index of accessibility to radial highways) is more discrimination, which has a larger standard deviation than other variables. The angles between arrows represent the relationships between them. There are 14 variables in training data, so it is hard to distinguish the differences among them. However, it can be concluded that most of variables are correlated with each other.
correct_classes <- test$crime
test <- dplyr::select(test, -crime)
str(test)
## 'data.frame': 102 obs. of 13 variables:
## $ zn : num -0.4872 0.0487 0.0487 -0.4872 -0.4872 ...
## $ indus : num -1.306 -0.476 -0.476 -0.437 -0.437 ...
## $ chas : num -0.272 -0.272 -0.272 -0.272 -0.272 ...
## $ nox : num -0.834 -0.265 -0.265 -0.144 -0.144 ...
## $ rm : num 1.015 -0.392 -0.563 -0.641 -0.794 ...
## $ age : num -0.8091 0.5089 -1.0507 -0.429 0.0329 ...
## $ dis : num 1.076671 1.154792 0.786365 0.334119 0.000692 ...
## $ rad : num -0.752 -0.522 -0.522 -0.637 -0.637 ...
## $ tax : num -1.105 -0.577 -0.577 -0.601 -0.601 ...
## $ ptratio: num 0.113 -1.504 -1.504 1.175 1.175 ...
## $ black : num 0.416 0.441 0.371 0.427 0.375 ...
## $ lstat : num -1.3602 0.0864 0.4281 -0.5858 -0.1923 ...
## $ medv : num 1.1816 -0.395 -0.0906 -0.2863 -0.4711 ...
lda.pred <- predict(lda.fit, newdata = test)
table(correct = correct_classes, predicted = lda.pred$class) %>% prop.table() %>% addmargins()
## predicted
## correct low med_low med_high high Sum
## low 0.098039216 0.117647059 0.029411765 0.000000000 0.245098039
## med_low 0.039215686 0.186274510 0.078431373 0.000000000 0.303921569
## med_high 0.000000000 0.088235294 0.127450980 0.009803922 0.225490196
## high 0.000000000 0.000000000 0.009803922 0.215686275 0.225490196
## Sum 0.137254902 0.392156863 0.245098039 0.225490196 1.000000000
According to the prediction results, the accuracy of the model is 0.657 (0.147 + 0.127 + 0.118 + 0.265).
data('Boston')
re_scaled <- data.frame(scale(Boston))
dist_eu <- dist(re_scaled)
summary(dist_eu)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.1343 3.4625 4.8241 4.9111 6.1863 14.3970
Because euclidean distance is the common and popular way to calculate the variable distances, which is applied for here. After distance measurement of the scaled variables, the minim distance is 0.1343 while the maxim one is 14.370. The median and mean distances are 4.8241 and 4.9111, respectively.
# determining the numbers of K-means
set.seed(123)
k_max <- 10
twcss <- sapply(1:k_max, function(k){kmeans(re_scaled, k)$tot.withinss})
# visualize the results
qplot(x = 1:k_max, y = twcss, geom = 'line')
Through the visualization of the trend of K numbers, it drops radically when k is 3. Thus, k number is selected as 3 for k-means clustering.
# choosing k = 3
km <-kmeans(re_scaled, centers = 3)
cluster <- as.factor(km$cluster)
ggpairs(re_scaled, aes(colour = cluster, alpha = 0.4),
columns = colnames(re_scaled),
upper = "blank",
diag = NULL)
There are three different colors in the plot due to K number is 3. For most of clusters, the data can be clustered into 3 separated groups. However, 3 centers are not clear in some figures, such as tax vs age and ptratio vs age.
set.seed(123)
k_max <- 10
twcss <- sapply(1:k_max, function(k){kmeans(Boston, k)$tot.withinss})
# visualize the results
qplot(x = 1:k_max, y = twcss, geom = 'line')
If directly using Boston data to find the optimized number of K. According to the plot, the trend becomes relatively stable after k is 3.
# choose k number and create a dataset with cluster
km <-kmeans(Boston, centers = 3)
newboston <- Boston
newscaled <- scale(newboston)
newscaled <- data.frame(newscaled)
newscaled$cluster <- km$cluster
# create the training and testing data
n <- nrow(newscaled)
ind <- sample(n, size = n * 0.8)
train1 <- newscaled[ind,]
test1 <- newscaled[-ind,]
correct_classes <- test1$cluster
test1 <- dplyr::select(test1, -cluster)
# lda analysis
lda.fit <- lda(cluster ~ . , data = train1)
newclass <- as.numeric(train1$cluster)
plot(lda.fit, dimen = 2, col = newclass, pch = newclass)
lda.arrows(lda.fit, myscale = 1)
On the above figure, three different groups of clusters are separated clearly. tax and rad have the long arrows which means that they are the most influential linear separator for the clusters.
model_predictors <- dplyr::select(train, -crime)
lda.fit <- lda(crime ~ . , data = train)
km <-kmeans(model_predictors, centers = 3)
dim(model_predictors)
## [1] 404 13
dim(lda.fit$scaling)
## [1] 13 3
matrix_product <- as.matrix(model_predictors) %*% lda.fit$scaling
matrix_product <- as.data.frame(matrix_product)
plot_ly(x = matrix_product$LD1, y = matrix_product$LD2, z = matrix_product$LD3, type= 'scatter3d', mode='markers', color = train$crime)
plot_ly(x = matrix_product$LD1, y = matrix_product$LD2, z = matrix_product$LD3, type= 'scatter3d', mode='markers', color = km$cluster)
From the two plots, there are two clear clusters which are separated from each other. When using crime as the indicator, the group of high crime rate is clustered with few cases of mediate high crime rate. Most of mediate high crime rate with other groups are clustered together. It means the reasons causing high crime rate are different from other types of crime rates. Because K number is selected as 3 based on the previous optimization analysis of K-numbers, the model is easily clustered to 3 different groups according to their correlations. However, some of points seeming closer to K-1 and K-2 groups are statistically related to K-3 closer.
(more chapters to be added similarly as we proceed with the course!)